Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

ABanditNAS: Anti-Bandit for Neural Architecture Search

Finally, after K ∗T samples where T is a hyperparameter, we calculate the conﬁdence

with the UCB according to Eq. 4.8 as

sU(o⁽^i,j⁾

) = m⁽^i,j⁾

k,t

2 log N

n⁽^i,j⁾

k,t

(4.12)

The operation with minimal UCB for every edge is abandoned. This means that operations

that are given more opportunities but result in poor performance are removed. With this

pruning strategy, the search space is signiﬁcantly reduced from |Ω⁽^i,j⁾|¹⁰^×⁶to (|Ω⁽^i,j⁾| −

1)¹⁰^×⁶, and the reduced space becomes

Ω⁽^i,j⁾←Ω⁽^i,j⁾−{arg min

o⁽^i,j⁾

sU(o⁽^i,j⁾

)}, ∀(i, j).

(4.13)

The reduction procedure is repeated until the optimal structure is obtained, where only one

operation is left on each edge.

Complexity Analysis. There are O(K^|E^M^|×^v) combinations in the search space dis-

covery process with v types of diﬀerent cells. In contrast, ABanditNAS reduces the search

space for every K ∗T epoch. Therefore, the complexity of the proposed method is the

following.

O(T ×

k=2

k) = O(TK²).

(4.14)

4.2.4

Adversarial Optimization

The goal of adversarial training [167] is to learn networks that are robust to adversarial

attacks. Given a network fθ parameterized by θ, a dataset (xe, ye), a loss function l and

a threat model Δ, the learning problem can be formulated as the following optimization

problem: minθ

e ^max^δ^∈^Δ^l

fθ(xe + δ), ye

, where δ is the adversarial perturbation. In this

chapter, we consider the typical l∞threat model [167], Δ = {δ : ∥δ∥∞≤ϵ} for some ϵ > 0.

Here, ∥· ∥∞is the l∞norm distance metric and ϵ is the adversarial manipulation budget.

The adversarial training procedure uses attacks to approximate inner maximization over

Δ, followed by some variation of gradient descent on model parameters θ. For example,

one of the earliest versions of adversarial training uses the Fast Gradient Sign Method

(FGSM) [75] to approximate the inner maximization. This could be seen as a relatively

inaccurate approximation of inner maximization for l∞perturbations, and it has the closed-

form solution: θ = ϵ · sign

∇xl

f(x), y

. A better approximation of inner maximization

is to take multiple smaller FGSM steps of size α instead. However, the number of gradient

computations caused by the multiple steps is proportional to O(EF) in a single epoch, where

E is the size of the data set and F is the number of steps taken by the adversary PGD.

This is F times higher than standard training with O(E) gradient computations per epoch,

and adversarial training is typically F times slower. To accelerate adversarial training, we

combine FGSM with random initialization [247] for our ABanditNAS. Our ABanditNAS

with adversarial training is summarized in Algorithm 8.

4.2.5

Analysis

Eﬀect on the hyperparameter λ. The hyper-parameter λ balances the performance

between the past and the current. Diﬀerent values of λ result in similar search costs. The

performance of the structures searched by ABanditNAS with diﬀerent values of λ is used